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Abstract. In real-time task scheduling on multiprocessor 
systems, partitioning approach has received the attention of 
many researchers because of its higher least upper bound 
utilization of safe systems. Semi-partitioning allows some 
tasks to be split into subtasks and each subtask to be 
assigned to a different processor. Though task splitting 
improves the performance of systems, by counting each 
subtask as a separate task, it increases the effective number 
of tasks to be scheduled, which in turn, raises the execution 
overhead. This research is on semi-partitioning of tasks and 
assigning each partition to a separate processor to be 
scheduled by the well-known scheduler Rate-Monotonic 
(RM). Using our algorithm, we do not need to define 
release time for subtasks of a task to assure their non- 
concurrent execution and the number of effective tasks, in 
turn, is reduced. It is theoretically proven that with the 
proposed semi-partitioning and RM scheduling algorithm, 
all processors may safely run their tasks according to their 
deadlines. Further, experimental results on 3000 randomly 
generated task-sets indicates that not only is utilization 
factor boosted, but the number of broken tasks also is 
decreased. 


keywords: Rate-Monotonic Least Splitting, Semi- 
Partitioning, Hard real-time, Multiprocessor Scheduling. 


1. Introduction 


Recently, the importance of utilizing embedded 
multicore and multiprocessor system-on-a-chip (MPSoC) 
has begun to appear as a sensible solution for both power 
efficiency and high-performance computing in diverse 
application areas including —_ telecommunications, 
multimedia systems, computer games, space systems, and 
process control to name but a few. The importance of the 
issue becomes more noteworthy when we know that 
traditional computer-based control systems as well as high 
technology systems, require high-performance computers 
for their computations which are usually possible with 
multiprocessor or multicore systems. A multiprocessor 
system is composed of several processing elements, called 
processors, in which all processors can do their processing 
in parallel. If all processors have the same architecture with 
one processor repeated design in the processor, the 
architecture is called homogenous. They all share the same 
main memory, but each can have their own private cache 
memory. With this structure, a sequential computation 
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can be shared among many processors if not more than one 
processor is executing the computation, simultaneously [1]. 
While manufacturers tend to use multiprocessors in new 
devices, the development of software facilities which use 
all available power of multiprocessors [2] is required. In 
this context, scheduling algorithms play a prominent role in 
safeness verification of hard real-time systems, i.e., making 
sure that every request is executed before its deadline. The 
problem becomes more challenging when there is more 
than one processor involved so that 
multiprocessor/multicore systems adds a new dimension to 
the analysis: how to assign tasks or their requests for 
different processors. Therefore, there are, overall, two key 
issues which are still open in multiprocessor scheduling as 
follows: 

1. Task assignment to processors: Finding a target 
processor to run every selected task; 

2. Identifying tasks priority: Making an appropriate order of 
priority to run tasks. 

In this article, the problem of scheduling periodic hard 
real-time task sets with implicit deadlines on 
multiprocessors is investigated. What we mean by the 
implicit deadline is that the deadline of a request is the 
exact time when the next request arrives from the same 
task. Many other researchers have studied the same 
problem but new ideas have kept this practical area 
dynamic, attractive, and improving. 

From the perspective of task migration, task assignment 
issue on a multiprocessor, generally, fall into two main 
categories: 

1. Global: These algorithms are employed where tasks are 
allowed to migrate from one processor/core to another. 

2. Partitioned: These algorithms are employed where tasks 
are not allowed to migrate from one processor/core to 
another. 

Moreover, there is another important category which is 
the hybrid of the above methods called semi-partitioned. 

In global scheduling, there is only one queue (or pool) of 
real-time requests and each processor takes its next 
execution request from this queue. Regardless of its 
advantages, in complex and large systems, the overheads of 
employing a single global queue would become too 
excessive. In partitioned approach, on the other hand, the 
set of tasks are divided and each partition is assigned to a 
separate processor. Finally, in semi-partitioned, some tasks 
are solely assigned to one processor and some tasks are 
shared among two or more processors, with the restriction 
that not more than one processor can work on a request for 
the shared task, simultaneously. Although the scheduler 
may be different for each of the three categories of, in 
almost all cases, the scheduler of all processors is 
considered to be the same. 
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While global scheduling techniques, depending on the 
hardware architecture, potentially could have very high 
overheads owing to the fact that the task migration from 
one processor to another usually yields communication and 
cache missing cost; however, in fully partitioned techniques 
this does not occur. Nonetheless, non-migration techniques 
often do not use all available processor capacity. Moreover, 
it is a widely occurring situation where the total unused 
capacity may be more than task utilization but no single 
processor has enough capacity to schedule the task. To 
solve those problems, a hybrid technique (called semi- 
partitioned) is used which combines elements of global and 
partitioned techniques. 

It is usually the case that semi-partitioned scheduling 
leads to a higher overall utilization of the whole system as 
compared to either of partitioned and global scheduling, for 
both fixed-priority and dynamic priority. Further, 
partitioning is a time-consuming task which is 
computationally equivalent to bin-packing problem that is 
known to be an NP-hard problem [3]. On the positive side, 
partitioning is done off-line. Therefore, for a small number 
of tasks, the time utilized for partitioning is tolerable but, 
for a large number of tasks efficient heuristics are used. 
Although an optimal semi-partitioned method has not yet 
been developed, many heuristic algorithms are presented by 
researchers. A semi-partitioned approach binds a disjoint 
set of whole tasks to each processor and allows the 
remaining tasks to be executed on multiple processors 
while all shared qualities are defined. In one of the 
researches on semi-partitioned methods in which 
Rate-Monotonic (RM) scheduler is used in each processor, 
worst-case utilization is reported to be In (2) = 0.693 [4]. 

In this article, a novel semi-partitioned scheduling 
algorithm called Rate-Monotonic Least Splitting (RMLS) is 
proposed for multiprocessors. The scheduler of each 
processor is RM with provisions to avoid simultaneous 
execution of a shared task by more than one processor. 
Using this algorithm, we see that the number of split tasks 
at most is equal to the number of used processors minus 
one. However, the actual number of split tasks might even 
be lower. Besides, no task is split into more than two 
subtasks. Splitting into fewer numbers of tasks has two 
benefits: 

1. Effective number of tasks in the Liu and Leyland’s 
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bound is reduced: O(n) = n(2n — 1) 
2. It increases overall system utilization. 

The remainder of this article is divided into seven 
sections: We firstly present our system model and notations 
in Section 2. In Section 3, related works are briefly 
reviewed. Section 4, describes the proposed RMLS semi- 
partitioned scheduling. Section 5, is the theoretical 
foundations and safeness proof of the algorithm; in Section 
6, the algorithm is simulated and results are documented, 
and finally a summary and future work are presented in 
Section 7. 


2. System Model and Notations 


We consider a system with m symmetric processors and 
the main aim of this article is to schedule a bag-of-tasks 
containing periodic hard real-time tasks with an implicit 
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deadline in such a system. A very important assumption is 
that all tasks are synchronous. That is, the deadline of any 
arbitrary task request is the exact time when the next 
request of that task arrives. We have also further assumed 
that all tasks are preemptive. From now on, in order to 
avoid potential difficulty in naming, we simplify call it task 
in this article. 
The following notations are used throughout the article: 
. n: total number of tasks 
. nı: total number of tasks and subtasks 
. m: total number of available or processors 
. mı: total number of used processors 
Ti i" task 
. T: minimum time between any two consecutive requests 
of task 7;, i.e. minimum request interval of 7; 
7. Ci: worst case computation time needed by every request 
of task 7;. It is clear that C; < T; 
8. T= {t,,7,,...,7,}: Set of all tasks (bag-of-task) 


9. p ={P,,P,,...,P.,}: set of all processors 


NnBWN 


ati ne Ci 
10. ui: the utilization of task t; which is equal to u; = = 
1 


11. Uí): total utilization of task-set I and subtasks 


To evaluate the performance of scheduling heuristics, 
there are many theoretical factors such as utilization bound, 
speedup factor and many others. In addition to those 
factors, there are many articles which have used empirical 
methods to show and compare the relative performance of 
different algorithms. The most widely accepted are the 
number of randomly generated task-sets. Many different 
algorithms are now available to generate random task-sets 
and evaluate performance, but they are almost similar to 
one other in using parameters such as the number of total 
processors, task-set utilization, the number of all tasks in 
the task set, the distribution of task deadlines, the range of 
task periods, and the distribution of every task utilization. 
More details about the used performance evaluation 
algorithm in the present article and the number of randomly 
generated task-sets are provided in Section 6. 


3. Related Works 


Many studies have been conducted in the domain of 
homogeneous multiprocessor and multicore systems since 
the 1960s. Among those, semi-partitioned approaches, 
which try to merge and employ the best attributes of global 
and fully partitioned approaches, have a prominent role 
because of solving the task allocation problem which is 
analogous to the bin-packing problem. Anderson and Tovar 
[5], in one of the first studies in this respect, proposed an 
approach for scheduling hard real-time periodic tasks with 
implicit deadlines called EKG. In this approach, the 
parameter k has been used to control the splitting of tasks 
into the light and heavy sets. This suggestion is to set k=2 
which yields the utilization bound of 66% and, on average, 
at most four preemptions per every task over the hyper 
period; yet, k=m gives the utilization bound of 100% and 
2k preemption for each task. In this context, many 
researchers have proposed the semi-partitioned problem 
with Earliest Deadline First (EDF) scheduling [6], [7]. The 
best known worst-case utilization bound known using semi- 
partitioned EDF scheduling on multicores is 65% for 
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Earliest Deadline Deferrable Portion (EDDP) algorithm [8]. 
EDDP distinguishes between heavy tasks (those whose 
utilization are greater than 65%) and light tasks (other tasks 
with utilization < 65%) in such a way that the algorithm 
firstly assigns every heavy task to its own processor and 
then light tasks are placed on the remaining processors. 
They showed that 65% is a safe utilization bound for EDDP 
provided tasks which are periodic with implicit deadlines. 
Later, in 2009, they proposed EDF with Window- 
constraint Migration (EDF-WM) which has less context 
switching overhead [9]. 

On the other hand, relatively fewer algorithms are 
proposed for fixed-priority algorithms [10]. Rate 
Monotonic Deferrable Portion (RMDP) and Deadline 
Monotonic with Priority Migration (DM-PM) fixed-priority 
algorithms are proposed by Kato et al. [11, 12]. The worst- 
case utilization bound of those algorithms is shown to be 
50%. RMDP consists of two phases: task assigning and task 
scheduling. The first phase is very simple: Sort all tasks in 
ascending order of 7;. Then, assign tasks to processors and 
if assigning a task causes the processor utilization bound to 
exceed, split the task into two subtasks. Sub-task 1 is placed 
on the current processor and sub-task 2 is assigned to the 
next processor. Task scheduling phase utilizes RM 
scheduler algorithm in every processor considering the fact 
that the second portion of a ready task in a processor cannot 
start to run until its first portion finishes execution. 

PDMS _HPTS DS is proposed by Lakshmanan et al. [2] 
which achieves the utilization bound of at least 60% 
providing tasks have an implicit deadline. It can, 
nevertheless, increase up to 65% if tasks are assigned to 
processors in order of decreasing utilization. Moreover, this 
bound can be extended to 69.3% for light tasks, i.e., tasks 
with utilizations less than 0.41. Guan et al. have proposed 
two algorithms which they called SPA1 and SPA2 [4], [10]. 
SPA2 has a pre-assignment phase in which special heavy 
tasks are first assigned to a separate processor. The 
advantages of this method are that the number of split tasks 
is m-l] and SPA2 reaches the worst-case utilization bound 
of 0.693. This is equal to Liu and Leyland’s bound [13] for 
single processor systems The disadvantage includes, the 
worst-case bound in SPA2 is calculated using n which is 
the cardinality of the whole task-set, and every processor’s 
utilization must be less than or equal to that. For example, 
although Liu and Leyland’s least upper bound utilization 
for a two-task processor is approximately 0.83, but with 
SPA2 its utilization should not exceed 0.693. With this 
explanation, the claim that SPA2 has reached Liu and 
Leyland’s utilization bound, does not seem to be entirely 
correct. 

For supplementary information on hard real-time task 
scheduling algorithms and related issues, the reader is 
invited to refer to [14]. 


4. Rate Monotonic Least Splitting 


The basic idea of the semi-partitioned method, which is 
being presented here, has been published in an in-progress 
research workshop [15]. In that paper, the fundamental 
theorem which shows the safeness of system was not 
proven. In addition, none of the other theoretical results 
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provided by this article have appeared in that paper. Now, a 
brief introduction of the method is given here and new 
findings and performance evaluations follow. The method 
is called Rate-Monotonic Least Splitting (RMLS) because it 
is a semi-partitioned method in which at most m;-] tasks 
are split and at the same time, to the best of our knowledge, 
the method of partitioning and task splitting presented here 
is very simple, comprehensible and it is shown to be 
efficient. 

Our experiments show that the achieved processor 
utilization is approximately 9.6% higher than the best- 
known results for general real-time systems, i.e., no 
restrictions on utilization of individual tasks, up to now. 

Having taken all the above explanations into account, in 
the following sub-section, we first propose our novel idea 
to demonstrate how tasks are placed on processors. Then, in 
the next sub-sections, we will have a fairly good discussion 
on the computational complexity of the RMLS algorithm. 


4.1. Task Assignment 


The proposed assignment algorithm is precisely outlined 
in Fig. 1 and we elaborate on it in the two super steps as 
follows: 

Step 1: Selection of single tasks and pairs of tasks to assign 
each one to a separate processor (lines 4 to 25). 

Step 2: Assignment of remaining tasks to remaining 
processors (lines 28 to 47). 

In the first step, tasks are sorted in descending order of 
their utilizations and the result is saved as a sorted task-set. 
A greedy approach is followed to find single tasks or pairs 
of tasks which can be assigned to separate processors to 
which other tasks will not be assigned to. To such 
processors, no subtasks will be assigned. In this phase, two 
pointers, i and j, are set to the beginning and the end of the 
set, respectively. If © (3) < u; + uj < 1 then these two tasks 
make a pair (lines 7 to 11) which is assigned to a separate 
processor; otherwise, one of these tasks is removed from 
the set (lines 18 to 22) and the process continues until the 
two pointers pass each other. The removed task will be 
scheduled in the second step. Step 1, continues by 
recognizing heavy tasks, i.e., a task t; with u; > 6 (2), and 
such task is assigned to a separate processor (lines 13 to 
16.) No subtasks will be assigned to these processors. 

To get the highest possible utilization, the scheduler of 
each processor with two tasks is a modified version of RM 
named Delayed Rate Monotonic (DRM) [16]. Suppose the 
two tasks t; = (C;, Tj) and 1; = (Cj, T;), are solely assigned to 
processor pg which has no other tasks. The delay time of 
each request of task t;, whose priority is higher than 1; with 
respect to RM, is taken to be equal to T;-C; while there is no 
delay for requests of task tj. Here, you can consider that the 
delay is similar to (but not exactly the same as) ready time. 

If the ready time of a request by 1; is T;-C; and the request 
arrives at time t, then it would not be possible to start 
executing this request until time t+T;-C;. On the other hand, 
the delay of task qt; is terminated at any time there is no 
pending request from task qj, and it will not re-enter delay 
state even if a new request arrives from qj. In addition, the 
delay of any request from this task can also end when T;-C; 
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time units have elapsed from the time that request is 
generated. 

This modification will guarantee that if the total 
utilization of the two tasks 1; and 1;, assigned to a separate 
processor is less than or equal to one, then the processor 
will always run safely [17]. 


Data: Task-set T //Includes Ti , Ci for each task i 
Result: Packing p 


1 p— 
2 k—0 
3 r'r 
4 Sort T" in descending order of tasks’ utilization 
5 ic1, jlength(T) 
6 while i<j do 
7 if O3)suy + uysl then 
8 kek+1 
9 Add Pk to p 
0 Move qi and t; from T to Pk 
1 icit] ; jj-1 
2 else 
3 if O(2)suy then 
4 kek+1 
5 Add Pk to p 
6 Move 7; from T to Pk 
7 else 
8 if O(3)suy + uy then 
9 i—i+1 
20 else 
21 jc-l 
22 end 
23 end 
24 end 
25 end 
26 kek+1 


27 Add Pk to p 
28 while [4#@ do 


29 Tithe task with the highest priority in T 
30 if UP) = OCIpxd) then 
31 kek+1 
32 Add Pk to p 
33 end 
34 if U(p,x) + u; < O(lpgl| + 1) then 
35 Move t; from T to Pk 
36 else 
37 if U(p,) < O(|p,| + 1) then 
38 Select tj from T where 
uy, + Ux) < O(|pxl + 2) 
39 Move 1; from I to Pk 
40 Split t; into T; and Tiz such that 
41 Ura = Opel + 1) — Ux) 
42 Replace q; in T by Tiz with u,,, = Sie 
we TRG 
43 Move Tj; to Pk 
44 k—k+1 
45 Add Pk to p 
46 end 
47 end 
48 end 
49 mlk //number of used processors 


Fig. 1. The packing algorithm 


The scheduler of all processors, except those which has 
two tasks, is the traditional RM without any delay or ready 
time for requests. 

Step 1, serves two sole purposes: (1) It increases the 
number of processors with higher utilization than those 
processors which are assigned tasks in Step 2, and (2) It 
increases the number of processors with no split task and 
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hence, it decreases the total number of tasks which will be 
split in Step 2. Thus, by reducing the effective number of 
tasks (the total number of tasks and subtasks) the intuition 
is that there would be less number of tasks preemptions 
during run time. 

In Step 2, all unassigned tasks will be sorted in 
decreasing order of RM priorities, i.e., the non-descending 
order of their request interval lengths. An empty processor 
is selected and then an unassigned task is selected from the 
sorted list to assign to the selected empty processor. This 
scenario will continue to repeat itself until the current task, 
say task t, will overload the processor (lines 29 to 35). 
Then a search amongst the remaining unassigned tasks 
must be done to find a task with maximum utilization 
which can be assigned to this processor without 
overloading it. If one is found, it will be assigned to the 
processor. If this processor is not filled, task 7; is then split 
into two subtasks so that the first subtask is assigned to the 
current processor and makes it full with respect to Liu and 
Layland’s bound for the respective number of tasks and 
subtasks in this processor (lines 37 to 44). 

To clarify, suppose that the current processor is p, and 
task q; is the task which is split into two subtasks ti; and Ti 
with execution times Cj, and Cj, respectively. The 


boar . Ci 
utilization of ti; is Uj, = m for processor py. A new 
i 


processor, px+1, is picked up and the second part of task qti 


which was split, i.e., Ti, is assigned to this processor. 
Although the actual utilization of this subtask is —4, its 


1 


effective utilization on processor pk is taken to be: 


MTEC, (1) 


The effective utilization of this subtask is greater than its 
ra ani Ci Ci : 
actual utilization, i.e. —2- > =. Therefore, the difference 


i” Mi1 i 

of these two values is what we have to sacrifice because 
there may be some situations in which both processors that 
share task t; want to execute this task but the only processor 
that can run it at this time, is the processor whose index is 
the lower. In Lemma 3, we will prove that, in the worst 
case, the second part of a request from task t; will have a 
time length of T;-C;;, not T;, to be executed. As mentioned 
earlier, this is due to the interference between the two 
processors that share task T;. 

For example, suppose tasks t)= (1.1, 4), m= (3, 17), and 
T= (3.2, 18) are completely assigned to processor pı and 
task t4 = (6.55, 20) is broken into two subtasks ty., = (2.55, 
20) and t4 2 = (4, 20) which are assigned to processors pı 
and p2, respectively. Besides, tasks ts = (5, 25) and t6 = (6, 
30) are completely assigned to processor p and from task 
T7 = (7, 42) the subtask t7 ; = (5.65, 42) is assigned to 
processor p>. The rest of task 17, i.e., t7 2 = (1.35, 42) and 
task tg = (47.4, 60) are assigned to processor p3. The total 
utilization of these three processors are computed as follow: 


z Ld 3 32,255 0.7567 

1> 4'17 18° 20 7 f 

gee ee OEE ry 

2" 20—255 25 30° 42 ` i 
1.35 47.4 

U; = 42—565 + So = 0.8271 
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In RMLS algorithm outlined in Fig. 1, the process of 
assigning tasks to the processors continues until all tasks 
are assigned. If processors are exhausted but some 
unassigned tasks still remained, the assignment is 
unsuccessful; otherwise, it is successful. 

Suppose the assignment is successful, RMLS splits at the 
most m,-1 tasks, where m, is the actual number of used 
processors. There is no release time or delay time for the 
tasks that are assigned in Step 2 and thus, the scheduler is 
the traditional RM with a minor amendment. Obviously, it 
is clear that only one processor can execute a sequential 
task at any given time. In order to make sure this vital 
condition is observed, whenever there is a conflict, the 
processor with the lower index must have the precedence in 
executing the shared request. That is, under RMLS, the 
execution of that portion of the split task which is assigned 
to the lower indexed processor is not affected by the 
execution of that portion of the split task which is assigned 
to the higher index processor. However, the execution of 
that portion of the split task which is assigned to the higher 
indexed processor may be delayed because the lower 
indexed processor is running the split task. In other words, 
the lower indexed processor can run its own portion of the 
split task whenever it desires to, but the higher indexed 
processor can only run its own portion if the lowered 
indexed processor is not running its portion of the split task. 

Using Equation (1) in computing the total utilization of 
processor px+; will reduce the actual sum of the task 
utilization on processor p+; to less than Liu & Layland’s 
bound. However, this is an unavoidable cost that we have to 
pay for all processors to run safely. 

On the positive side, by using RMLS, there is no need to 
restrict the sum of utilizations of all processors to be less 
than or equal to O(n) (where n; is the total number of tasks 
and subtasks of the whole system after partitioning and 
complete assignment). That is, the total utilization could be 
more than 69.3% and the system is still in a safe state for 
any arbitrary number of processors (e.g. for a large number 
of processors). 


4.2. Computational Complexity 


To calculate computational complexity, we divide the 
scheduler into three steps. The task-set is sorted in the first 
step; thus, for a task-set of size n, the computational 
complexity of the first step would be O(nlog n). In the 
second step, a search for large tasks and pairs of tasks is 
done. This step is preceded by a loop. This loop continues 
until i and j variables, which respectively started from the 
beginning and the end of the tasks list, become equal. Since 
at least one of those variables is changed during each 
iteration, this loop iterates at most n times. Thus time 
complexity of the second step is to O(n). The third step 
assigns the remaining tasks (at most n tasks) to processors. 
This step contains a loop, which iterates one time for each 
task (either it is split or not). However, before every task 
splitting, a search must be conducted among unscheduled 
tasks. This search might find a small task to put in the 
current processor. As this procedure is done for every 
processor, the total time complexity of the third step would 
be O(mn), in which m is the number of processors used. 
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The total time complexity of the scheduling algorithm 
can be calculated by sum of complexities of steps 1 to 3, 
which is 
O(max(n log n, mn)). 


5. Safeness Verification of RMLS 


The great advantage of RMLS is that Liu & Layland’s 
bound is only computed based on the total number of tasks 
and subtasks assigned to every processor separately and 
that is why well-known similar previous researches could 
not reach such a high degree of freedom. As a fine example, 
Guan et al. [4] proposed a semi-partitioned algorithm with 
this additional constraint that the total utilization of all tasks 
coupled with subtasks must not exceed Liu & Layland’s 
bound, whereas we successfully removed this restriction in 
our algorithm RMLS. We first present three lemmas and 
then prove the claimed statement. 

In the rest of this article, it is assumed that two 
processors px and pk share a task qt; = (T;, C;) and for each 
request of the common task Ci; is executed by py and Ci is 
executed by pk so that C=C;,+Cj2. In addition, effective 
utilization of the second part of a shared task is used as its 
utilization in the corresponding processor. 


Lemma 1. 


If Liu & Layland’s bound, is satisfied by all processors, 
the second part of a request from a shared task, ti, between 
two processors, pand p+, never overruns. 


Proof. 


The preference of executing a request from the shared 
task t; between processors pk and px; is given to px. 
Furthermore, the second part of a request from task t; has 
the highest priority within all tasks in px. Therefore, as 
soon as a request from task qt; is generated, its execution 
starts by either px or px; and continues executing 
(migrating between the processors, if necessary) until the 
second part of the task is completed. Therefore, in the worst 
case scenario, the execution of the second part of the task 
will be completed after a time length of C; is passed from 
its request (C;< T)). 


Definition 1. 


A conflict-idle period is a time interval in which both 
processors, pk and pk, that share the shared task ti, want to 
run a request from the task, but because px is given a higher 
precedence, it will proceed with its execution; and at the 
same time, there is no other pending requests for processor 
Px+1 Within this period and it will be idle. Note that, not all 
conflict periods of processors pg and px+; are necessarily 
conflict-idle because if there are other requests for p,+; then 
it will proceed with their execution and hence, it will 
therefore not be idle. 

Consider a situation in which the task qt; is split into 
subtasks 1;, and Tip, and they are assigned to processors pk 
and Pkn, respectively. Subtask t; is the task with the lowest 
priority (or in some cases the second lowest priority) within 
processor p, while task tip is always the task with the 
highest priority among all tasks and subtasks assigned to 
processor pk. This decreases the chance of encountering a 


48 Sedaghat et. al.: Safety Verification of Rate-Monotonic Least-Splitting Real-Time ... 


situation in which both processors that want to 
simultaneously run the task 1;; however, it is not zero. 
Therefore, conflict periods are very rare and as a result, 
seldom will conflict-idle periods to take place. 


Lemma 2. 


Suppose two processors pk and pk share a task t; and 
run ng and ny; tasks while their total utilization is not 
greater than O(n,) and O(n), respectively. If there will 
not be any conflict-idle period with respect to 1;, then both 
processors will always run overrun-free. 


Proof. 


Since processor p, has a higher precedence to run q; than 
Pk, this processor will always run overrun-free. On the 
other hand, the only effect that p, can have on the execution 
of tasks of processor p+; is that it may postpone the 
execution of the second part of a request from the shared 
task. This may harm the overrun-freeness of the shared task 
in px+; but it can be beneficial to the other tasks of this 
processor. However, in Lemma 1, it was proven that the 
second part of a request from a shared task never overruns. 
Therefore, this processor runs overrun-free as well. 


Lemmas 1 and 2, will hold even if actual utilization of 
subtask Tj, 1.€., = , is used for computing the utilization of 
1 
Pki. It is for compensation of possible conflict-idle periods 
that, in general, effective utilization of the shared task on 
Ciz 
Ti- Cin 


processor p+; is computed as 


Definition 2. 


The remaining utilization of a request (not a task or 
subtask) at a given time is defined to be its remaining 
execution time divided by its remaining time to reach the 
deadline. At the exact time when a request is generated its 
remaining utilization is equal to its actual utilization. 
However, as time passes, its remaining utilization may 
fluctuate depending on how much time has been passed 
from its request time and how much it has been executed 
until the time that the remaining utilization is computed. 

For example, suppose task t = (10, 4) has generated a 
request at time 20 and the current time is 26 and up to now, 
this request has received 1.5 unit of CPU time. The 
remaining utilization of the request at time 26 is 
(4 — 1.5)/(30 — 26) = 0.625. 


Ti IN Px 


Ti2 1N Pett 


Fig. 2. A Sample execution of parts of a split task 


Lemma 3. 


Suppose two processors px and pk share a task ti. The 
remaining utilization of a request from 1; for processor P+ 
is maximal at the exact time when the execution of 
processor p,;’s share of this request is completed when pk 
starts this request immediately after it is generated and 
continues running it until its share is completed. 


Proof. 


Suppose as soon as a request from 1; is generated at a 
time to, processor pp starts to execute it until its share is 
finished at time to+C;,. At this time, effective utilization of 


“2_ We show that this 
Ty-Cjy 
is, in fact, the maximal effective utilization of tip, which 
means subtask 1j2’s effective utilization never becomes 
greater than this value. It is worth mentioning to recall that 
requests of task t; have the highest priority in processor 
Pk. This implies that any request for this task will be 
immediately picked up for execution by p,,; if pk is not 
executing it. 

On the other hand, if the execution of the second part of a 
request of task t; is completed by processor pk then its 
remaining utilization becomes zero and remains zero until a 
new request is generated from the same task. 

With these points in mind, consider a situation where at 
any time t; (to < tı < to+C;), processor px has executed this 
request for the duration of length an (a < C;,), and processor 
Pk has executed the same request for duration b (b < Ciz 
and a+b = ti-to). This example is illustrated graphically in 


the subtask tiz on Pı is equal to 


Fig. 2. 
At time t; effective utilization of Tiz is: 
Ciz — b 
Ti = (a + b) 
Since a < Cy, 
Cj2—b Ci2—b Cj2—b 
Ti-(at+b) ~ Ty-(Cji+b) = Ty-Cjy—-b 


To show that the maximal effective utilization of tz is 
Ge. 
=. it has to be shown that: 


Ti—Ci1 
Ciz-b_ < _Cig 

Ty-Cjy-b ~ Ti-Ci4 
That is, 

(Ciz — b)(T; — Cia) < Giz (T; = Cia — b) 
Or, 

—bT; + bCi < —bCji2 

Or, 


b(Ci + Ci2) < bT; 


Which is always true because b is positive andC;, + 

We have now provided a solid foundation of what must 
be considered to prove the safeness of multiprocessor 
systems to satisfy the requirements verbalized by RMLS 
scheduling algorithm. 


Theorem 1. 


If effective utilization of each of two processors p, and 
Pk Which share a task qti, is not greater than Liu and 
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Layland’s bound, then both processors will always safely 
run their corresponding tasks and subtasks. 


Proof. 


This theorem is similar to Lemma 2 in which it is 
assumed that there will be no conflict-idle period. However, 
here, this restriction is removed. In Lemma 2, it is 
mentioned that processor p+ does not have any influence 
on the execution of tasks and subtasks assigned to processor 
Pk- Since Liu and Layland’s bound is satisfied for px it will 
always safely run its assigned tasks and subtask. In the 
packing algorithm (Fig. 1), the utilization of the shared task 


Ciz 


on processor pk is computed as which, based on 


i> Cin 
Lemma 3 and is the maximum utilization that tiz can always 
impose on the processor px-+1. 

On the other hand, the utilization of processor is taken to 
be less than or equal Liu and Layland’s bound. Therefore, 
this processor will always safely run its assigned tasks and 
subtask, as well. 

In each processor pp (k=1,2,3,.., mı), at most, there is 
only one task which is shared with the processor px.; (if 
k>2) and one task which is shared with processor px; (if 
k<m)). Using Theorem 1 twice, once for px-; and pk and 
once for pk and Pk, we can conclude that all processors 
would be safe with RMLS. 


6. Simulations 


Despite the fact that theoretical results such as speedup 
factors, play a prominent role to verify the schedulability 
performance, over the last years, many different empirical 
studies have also aimed to investigate the relative 
schedulability test performance amongst many different 
scheduling algorithms in the domain of real-time systems. 
Empirical studies also provide more general schedulability 
tests by focusing more on individual tasks parameters 
which yield to take into account non-specific task-sets. 
Task parameters such as the number of processors and 
tasks, the total utilization of the task-set, task deadline 
distribution and period and many others, would be very 
important especially for those techniques that will be 
employed in safety-critical industrial environments. 

In the current section, the proposed method is compared 
with SPA2 [10]. We used UUnifast algorithm, a de facto 
standard proposed by Bini and Buttazo [18], to produce 
random task-sets without any bias. It is given that the n 
tasks' utilization (as random variables) are uniformly 
distributed between 0 and 1, and the sum of them is a 
specific value which is the total utilization of the system. 
The UUnifast algorithm uses the cumulative distribution 
function and generates task-sets with uniform distribution 
in O(n) time order. For further reading please refer to [18], 
[19]. 


6.1. Comparisons 


To compare the results of packing, we used the method 
used by Burns et al [7]. In this method, task-sets are divided 
into some categories. For each category of task-sets, the 
result mean of that category is selected for comparison 
purposes. Task-sets in each category have the same overall 
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utilizations and the same number of tasks. For example, 
there are 200 task-sets in the first category, with their 
overall utilization is equal to 4, and for this category, there 
are 16 tasks in each task-set. 

Experiments were performed on 3000 randomly 
generated task-sets with a different number of tasks and 
different overall utilizations. Overall utilizations used are 4, 
8 and 16. The number of tasks tested with each utilization is 
shown in Figures 3 to 5. For example, we compared the 
three methods SPA2, RMLS, and RMLS+DRM with task- 
sets and with the overall utilization of 8 so that task-sets 
contain 16, 20, 28, 44, and 76 tasks. 

We allow RMLS algorithm to assign processors as 
needed and, for the SPA2 algorithm, we initially start with 
a high number of processors with which we are assured of 
the safety of the system. Then, we gradually reduce the 
number of processors one at a time until reaching the 
lowest number of processors in which the system is still in a 
safe mode. When the minimum number of processors 
needed for each method is found, the average utilization of 
all processors is calculated by dividing the overall 
utilization of each task set by the number of processors 
simply. 

The primitive version of RMLS (represented by PRMLS 
in Figures 3 to 5) uses RM scheduler in all processors. 
Thus, the utilization of systems containing two processors 
must not exceed the higher bound of © (2). Moreover, in 
that version, single tasks whose utilizations are greater than 
or equal to 0.83, were not separated to schedule each one on 
a single processor. The partitioning algorithm of PRMLS is 
a simplified version of the algorithm outlined in Fig. 1, in 
which all activities concerning Step 1 is removed. We 
would now like to compare primitive RMLS with SPA2 
and RMDP. 

The median utilization of SPA2 is either the same or 
lower than PRMLS. However, for equal medians, error bars 
indicate that the efficiency of PRMLS partitioning is better 
than SPA2. The average utilizations achieved for whole 
task-sets are 0.680 and 0.735 for SPA2 and RMLS 
respectively. This shows that the overall performance of 
Primitive RMLS (PRMLS) is more than 8% higher than 
that of SPA2. 
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Fig. 3. Median of performance, by each method, for U=4 
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Fig. 4. Median of performance, by each method, for U=8 
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Fig. 5. Median of performance, by each method, for U=16 
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Fig. 6. Success rate for each method, for 3000 task sets 


6.2. Discussion 


Our experiments reveal that using DRM for two-task 
processors, greatly improvement the overall utilization of 
processors. The average utilization for RMLS on the 
randomly generated 3000 task-sets was 0.776. Medians, 25 
and 75 percentiles are shown in Figures 3 to 5. An overall 
improvement of more than 14% as compared to SPA2 is a 
remarkable achievement for RMLS. 

By comparing RMLS and PRMLS, one gets the 
impression that a little change can have a great performance 
improvement (i.e. close to 6%). Experiments, however, 
show better performance for RMDP method, which is 
caused by its scheduler. The scheduler of the RMDP 
method differs from rate-monotonic and, in actual fact, it is 


really more complex and involved; thus, its better 
performance is expected. 

Although one can infer from Figures 3 to 5, that SPA2 
usually should use a higher number of processors to safely 
schedule the same set of tasks as compared to both of 
PRMLS and RMLS, some complement charts are provided 
for visual comparisons (see Figures 7 to 9). 

In addition, another experiment was conducted to test the 
schedulability of RMLS, PRMLS, SPA2, and RMDP on 
total 3000 randomly generated task-sets (see Fig. 6). In this 
experiment, we set different system utilizations and 
measure the ratio of task-sets that are schedulable. Taking a 
quick look at Figure 6, we clearly see that for all task-set 
with total utilization U(T) <= 0.66, all four algorithms can 
schedule every task-set with success ratio 1; however, for 
task-sets with total utilization greater than 0.66, SPA2 and 
PRMLS show their downside so that system total utilization 
of below 0.7, for instance, only 9% of randomly generated 
task-sets are schedulable using SPA2 whereas RMLS 
schedules about 79% of task-sets safely. 

As was mentioned before, the high ability of RMDP to 
schedule task-sets is owing to its complicated algorithm. 
Additionally, please note that RMDP scheduler policy does 
not rate-monotonic. 
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Fig. 7. Number of used processors for each method, for U=4 
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Fig. 8. Number of processors used for each method, for U=8 


Journal of Computer and Knowledge Engineering, Vol. 1, No. 1 


25 : . : - 


20 4 


15 q 


Average number of processors 


67 106 183 
Number of tasks 


Fig. 9. Number of processors used for each method, for U=16 


7. Summary and Future Work 


Significant advances which have been made in many 
industrial areas are good evidence to support this claim that 
the importance of utilizing embedded multiprocessor 
system-on-a-chip (MPSoC) is an undeniable fact. In this 
context, regardless of many problems that must be 
considered, task management is a key issue which we 
focused on in the current article. Out of different 
approaches for hard real-time task scheduling, semi- 
partitioning of periodic tasks on multiprocessors was 
studied here in which the scheduler of each processor is rate 
monotonic, with the exception that the scheduler of 
processors with exactly two whole tasks is delayed rate 
monotonic (DRM) [16]. It was proven that when a task is 
split between two processors, if the utilization of the second 
part of the task is considered a little higher than its actual 
utilization, and RM scheduling policy is used on both 
processors while Liu and Layland’s bound is satisfied, both 
processors run safely and all tasks meet their deadlines. 

With this method, there is no need to define a release 
time for the second subtask. The Rate-Monotonic Least 
Splitting (RMLS) algorithm was developed and its 
performance was compared with the SPA2 algorithm as 
well as RMDP. It was concluded that the performance of 
PRMLS (Primitive RMLS) is more than 8% higher than 
SPA2 and the performance of RMLS is more than 14% 
higher than that of SPA2. This means that both PRMLS and 
RMLS usually need a fewer number of processors to safely 
schedule the same set of real-time tasks than SPA2, using a 
semi-rate-monotonic scheduler. 

Although many types of research in the domain of semi- 
partitioned scheduling are being conducted, the use of new 
methods seems to be very important to improve real-time 
scheduling performance on multiprocessors. For example, 
authors in [20] did employ the equation of the line to 
dynamically assign priority to the tasks (called LTS) which 
appear to be an interestingly novel method in global 
multiprocessor scheduling. They claimed that their 
algorithm schedules all periodic task sets with total 
utilization up to 100% safely. One can be to modify (and 
improve) the LTS algorithm so that it is possible to use in 
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semi-partitioned multiprocessor scheduling (e.g. the policy 
of each processor scheduler). 
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